Skip to content

fix(ci): increase apt retry timeout to prevent kill EPERM crash#29715

Merged
alucardzom merged 2 commits into
mainfrom
ale/infra-3580-apt-retry-timeout-fix
May 7, 2026
Merged

fix(ci): increase apt retry timeout to prevent kill EPERM crash#29715
alucardzom merged 2 commits into
mainfrom
ale/infra-3580-apt-retry-timeout-fix

Conversation

@alucardzom
Copy link
Copy Markdown
Contributor

@alucardzom alucardzom commented May 5, 2026

Description

Problem

Android E2E runs are crashing with Error: kill EPERM in the "Set up E2E environment" step (example). This was introduced by #29236, which wrapped sudo apt-get inside nick-fields/retry with timeout_minutes: 3.

Root cause: When the 3-minute timeout fires, nick-fields/retry calls process.kill() on the child process. But sudo apt-get runs as root while the Cirrus runner process runs as admin — Node.js gets EPERM (permission denied) on the kill syscall. This is a known upstream bug (open since Oct 2023, 11 upvotes, unpatched).

Why the timeout fires: DPkg::Lock::Timeout=120 means apt-get can legitimately wait up to 120s for the dpkg lock on each of the two sudo calls (update + install). With slow Ubuntu mirrors on top, total time can approach or exceed 180s (3 min), triggering the timeout. The 3-minute value was tightened from the original 5-minute design in a follow-up commit on #29236, which didn't account for the double lock-wait scenario.

Fix

  1. Restore timeout_minutes from 3 to 5 — gives 300s per attempt. Even worst-case (120s lock on update + 120s lock on install + 30s actual install = 270s) fits with 30s headroom. apt-get resolves on its own (success or dpkg lock timeout error) before the retry timeout fires, so the process.kill() path — and the EPERM bug — is never hit.

  2. Add retry_on: error — only retry when apt-get exits with a non-zero code (mirror desync, lock timeout), not when nick-fields/retry's own timeout fires. A timeout-triggered retry would crash with EPERM anyway, so this avoids a wasted attempt.

Timing analysis

Scenario Duration Fits in 5 min?
Happy path (no lock, fast mirror) 5-15s Yes (295s margin)
Lock on one call + normal mirror 120s + 15s = 135s Yes (165s margin)
Lock on both calls + slow mirror 240s + 30s = 270s Yes (30s margin)
Lock on both + very slow mirror 240s + 60s = 300s Boundary — but this is extremely unlikely

Changelog

CHANGELOG entry: null

Related issues

Refs: INFRA-3580
Fixes regression from #29236

Manual testing steps

N/A — CI infrastructure fix. Validated by any Android E2E workflow run. The timeout increase is transparent in the happy path (apt-get takes 5-15s).

Screenshots/Recordings

Before

N/A

After

N/A

Pre-merge author checklist

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Made with Cursor


Note

Low Risk
Low risk CI-only change that adjusts retry behavior for Linux apt-get during Android E2E setup; primary impact is longer waits before failing and fewer timeout-triggered retries.

Overview
Reduces flaky Android E2E setup failures by updating the setup-e2e-env composite action to increase the nick-fields/retry apt-get wrapper timeout from 3 to 5 minutes.

The retry wrapper is also configured with retry_on: error so retries only happen on non-zero exits, avoiding retries triggered by the action's own timeout.

Reviewed by Cursor Bugbot for commit 8bcc995. Bugbot is set up for automated code reviews on this repo. Configure here.

@alucardzom alucardzom self-assigned this May 5, 2026
@metamaskbotv2 metamaskbotv2 Bot added the team-dev-ops DevOps team label May 5, 2026
@alucardzom alucardzom added no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed team-mobile-platform Mobile Platform team labels May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@alucardzom alucardzom added the skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run label May 5, 2026
@alucardzom alucardzom marked this pull request as ready for review May 5, 2026 08:29
@alucardzom alucardzom requested a review from a team as a code owner May 5, 2026 08:29
alucardzom and others added 2 commits May 6, 2026 08:46
Restore timeout_minutes from 3 to 5 for the apt-get retry step, and
add retry_on: error to only retry on command failure, not timeout.

The 3-minute timeout was too tight: DPkg::Lock::Timeout=120 means
apt-get can legitimately take up to 120s waiting for the dpkg lock
on each of the two sudo calls (update + install). With slow mirrors
on top, total time can approach or exceed 180s, triggering the
nick-fields/retry timeout. When the timeout fires, the action tries
to process.kill() the sudo child process, but since sudo runs as
root and the runner runs as admin, Node.js gets EPERM — a known
upstream bug (nick-fields/retry#124). The action crashes instead of
retrying.

With timeout_minutes: 5 (300s), even worst-case lock wait (240s) +
slow install (30s) = 270s fits with 30s headroom. apt-get resolves
on its own before the timeout fires, so the kill path is never hit.

Adding retry_on: error ensures retries only happen on actual apt
failures (mirror desync, lock timeout), not on the retry-action's
own timeout — which would crash with EPERM anyway.

Co-authored-by: Cursor <cursoragent@cursor.com>
@alucardzom alucardzom force-pushed the ale/infra-3580-apt-retry-timeout-fix branch from 7274995 to 8bcc995 Compare May 6, 2026 06:47
@github-project-automation github-project-automation Bot moved this to Needs dev review in PR review queue May 6, 2026
@github-project-automation github-project-automation Bot moved this from Needs dev review to Review finalised - Ready to be merged in PR review queue May 6, 2026
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

@alucardzom alucardzom added this pull request to the merge queue May 7, 2026
Merged via the queue into main with commit 2c7c3e8 May 7, 2026
60 checks passed
@alucardzom alucardzom deleted the ale/infra-3580-apt-retry-timeout-fix branch May 7, 2026 08:37
@github-actions github-actions Bot locked and limited conversation to collaborators May 7, 2026
@metamaskbotv2 metamaskbotv2 Bot added the release-7.77.0 Issue or pull request that will be included in release 7.77.0 label May 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed release-7.77.0 Issue or pull request that will be included in release 7.77.0 size-XS skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run team-dev-ops DevOps team team-mobile-platform Mobile Platform team

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants